{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lab we'll look at images as data. In particular, we'll show how to use standard Python libraries to visualize and distort images, switch between matrix and vector representations, and run machine learning to perform a classification task. We'll be using the ever popular MNIST digit recognition data set. This will be a two part lab, comprised of the following steps:\n",
    "\n",
    "Part1:\n",
    "1. Generate random matrices and visualize as images\n",
    "2. Data Prep and Exploration\n",
    "-- Load MNIST data\n",
    "-- Convert the vectors to a matrix and visualize\n",
    "-- Look at pixel distributions\n",
    "-- Heuristic feature reduction\n",
    "-- Split data into train and validation\n",
    "\n",
    "Part 2:\n",
    "3. Classification\n",
    "-- One vs. All using logisitic regression.\n",
    "-- Random Forests on all\n",
    "4. Error analysis\n",
    "-- Visualize confusion matrix\n",
    "-- Look at specific errors\n",
    "5. Generate synthetic data and rerun RF\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os, sys\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "\n",
    "#Student: change the data directory to where you have the data\n",
    "data_dir = './data/'\n",
    "data_train = data_dir + 'mnist_train_10k.csv'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generate Random Matrices and View as Images"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before looking at actual data, let's introduce the basic image data construct in Python. A color image is generally a 3-dimensional array with dimensions: NxMx3. The last dimension is the color channel, and the first two represent the X and Y grid of the image. The values are typically integer-valued pixel intensities for each color channel, with values between 0 and 255.\n",
    "\n",
    "The first thing we'll do is generate a random image and look at it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Student: generate a 3-dimensional random array of integers in the range of [0,255]\n",
    "x = 256\n",
    "y = 256\n",
    "d = 3\n",
    "\n",
    "\n",
    "rand_img = np.random.randint(#Student put code here)\n",
    "img = plt.imshow(rand_img, interpolation='nearest')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now do it again with a much smaller grid to see how it looks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#another way\n",
    "x = 32\n",
    "y = 32\n",
    "d = 3\n",
    "\n",
    "rand_img = np.random.randint(#Student put code here)\n",
    "img = plt.imshow(rand_img, interpolation='nearest')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Prep and Exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Read in the data and take a look at the first few rows\n",
    "train = #Student put code here\n",
    "train.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Check to see the size of the data (rows and columns). Remember the first column is the label\n",
    "train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's do a little prep on the data. We need to split the training data to produce a validation set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "val_pct = 0.1\n",
    "X = train.drop('label', 1)\n",
    "y = train.label\n",
    "X_train, X_val, y_train, y_val = #Student put code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this data is in a single vector format (i.e., the dimensionality of each record is 1xK, and not in the typical 2-dim photo layout). Before doing any data mining, let's first explore the data visually.\n",
    "\n",
    "The first thing we can do is look at the actual images. What better way to visualize image data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#First let's build a function that converts a vector into an image matrix and plots it\n",
    "\n",
    "def plot_vector(vec):\n",
    "    '''\n",
    "    Takes in image vector, transforms and plots\n",
    "    '''\n",
    "    #Reshape the images into 28x28 d-dim matrices\n",
    "    v_sq = #Student put code here\n",
    "    \n",
    "    #Now plot\n",
    "    plt.imshow(v_sq, interpolation='nearest', cmap = 'gray')\n",
    "\n",
    "#plot the first few number images along with their label.\n",
    "s = 16\n",
    "fig = plt.figure(figsize=(12,12))\n",
    "for i in range(16):\n",
    "    fig.add_subplot(4, 4, i + 1)\n",
    "    plot_vector(X_train.iloc[i])\n",
    "    plt.title(str(y_train.iloc[i]))\n",
    "    plt.xticks([])\n",
    "    plt.yticks([])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see from the black and white images, many pixels might only have black (x=0) values. These won't be useful for modeling, so we can drop them. Write a function that finds all of the features with stdev=0 and then drop them from the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#The describe method returns a dataframe with features as columns and distributional stats as rows\n",
    "#hint: use the dataframe.describe() and drop() methods and this can be done in 2 lines\n",
    "def get_no_variance_features(df):\n",
    "    '''\n",
    "    Input: a data frame\n",
    "    Ouput: a list of features with no variance, using the dataframe.describe() method\n",
    "    Note: This can be done in 2 lines\n",
    "    '''\n",
    "    desc_df = #Student put code here\n",
    "    return #Student put code here\n",
    "\n",
    "drop_list = get_no_variance_features(X_train)\n",
    "\n",
    "X_train_filt = X_train.drop(drop_list, 1)\n",
    "X_val_filt = X_val.drop(drop_list, 1)\n",
    "\n",
    "\n",
    "#Compare the shapes of the original and scrubbed dataset\n",
    "X_train.shape, X_train_filt.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now take a look at the distribution of pixel values in a more traditional way. Let's also look at the distribution by class to see if we see any differences. As there are many pixels, let's look at the top 4 by pixel variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Get the standard deviation of each feature (using describe() as in above)\n",
    "#Sort the results and pull out the top 4 feature names\n",
    "\n",
    "df_std = #Student put code here\n",
    "df_std.sort_values(#Student put code here)\n",
    "top4_list = #Student put code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Now for each feature plot the histogram by each class label\n",
    "Y_labels = range(10)\n",
    "\n",
    "fig = plt.figure(figsize=(12,8))\n",
    "for i, var in enumerate(top4_list):\n",
    "    ax = fig.add_subplot(2,2,i+1)\n",
    "    ax.set_xlim([0,256])\n",
    "    ax.set_ylim([0,0.5])\n",
    "    for y in Y_labels:\n",
    "        plt.hist(X_train[var][(y_train==y)], bins=100, normed=1, histtype='step')\n",
    "        plt.title(var)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given how polarized the pixels are, this is a bit hard to read. Next let's just look at means by label group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "'''\n",
    "Using the same list of top 4 features by stdev, group the data by the Y variable and take the means of\n",
    "each of the 4 variables by label group.\n",
    "'''\n",
    "\n",
    "#Student put code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can we start to see how the distributions differ by class?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With our data generally explored and prepped and a little intuition about how pixel values might help us discrimate between the digits, let's start to use our predictive modeling tools.\n",
    "\n",
    "We'll first start with a simple linear model. Also, this is a multi-class scenario so we can't just use and out-of-the-box logistic regression. Luckliy, like for most things machine learning, SkLearn has a tool for that.\n",
    "\n",
    "http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.multiclass import OneVsRestClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "#Next let's build the model\n",
    "ova1 = OneVsRestClassifier(#Student put code here)\n",
    "ova1.fit(#Student put code here)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a quick look under the hood of the OneVsRestClassifier object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ova1.__dict__.keys(), len(ova1.estimators_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ova1.estimators_[0:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now let's evaluate on test data using a confusion matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#First execute this function\n",
    "import itertools\n",
    "def plot_confusion_matrix(cm, classes,\n",
    "                          normalize=False,\n",
    "                          title='Confusion matrix',\n",
    "                          cmap=plt.cm.Blues):\n",
    "    \"\"\"\n",
    "    This function prints and plots the confusion matrix.\n",
    "    Normalization can be applied by setting `normalize=True`.\n",
    "    \"\"\"\n",
    "    acc = np.diag(cm).sum() / float(cm.sum())\n",
    "    \n",
    "    plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
    "    plt.title('{},Acc={}'.format(title, acc))\n",
    "    #plt.colorbar()\n",
    "    tick_marks = np.arange(len(classes))\n",
    "    plt.xticks(tick_marks, classes, rotation=45)\n",
    "    plt.yticks(tick_marks, classes)\n",
    "\n",
    "    if normalize:\n",
    "        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
    "\n",
    "    thresh = cm.max() / 2.\n",
    "    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
    "        plt.text(j, i, np.round(cm[i, j],3),\n",
    "                 horizontalalignment=\"center\",\n",
    "                 color=\"white\" if cm[i, j] > thresh else \"black\")\n",
    "\n",
    "    plt.tight_layout()\n",
    "    plt.ylabel('True label')\n",
    "    plt.xlabel('Predicted label')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "#Next, predict the class on validation data and generate the confusion_matrix\n",
    "preds_ova = ova1.predict(#Student put code here)\n",
    "cm = #Student put code here\n",
    "\n",
    "#Now use the function above to plot it\n",
    "np.set_printoptions(precision=3)\n",
    "plt.figure(figsize = (10,10))\n",
    "plot_confusion_matrix(#Student put code here)\n",
    "\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the above, are any of the errors a bit weird?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Visualize errors where (pred, true) in [(2,6), (8,2),(8,1)]\n",
    "\n",
    "pred_true_list = [(2,6), (8,2),(8,1)]\n",
    "'''\n",
    "Loop through the above list of tuples.\n",
    "Filter out the validation records where the prediction is the first element, and the truth is the second\n",
    "Then call the plot_vector function to visualize\n",
    "Add a title showing prediction and truth\n",
    "'''\n",
    "for pt in pred_true_list:\n",
    "    Xs = #Student put code here\n",
    "    for i in range(#Student put code here):\n",
    "        plt.figure()\n",
    "        plot_vector(#Student put code here)\n",
    "        plt.title('pred={},truth={}'.format(pt[0],pt[1]))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This accuracy is pretty good for an untuned LR on a sample of data. Can we do better with a random forest? Without running first, why might a RF do better in this case?\n",
    "\n",
    "# this should conclude Part 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## RF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "#Build a random forest with 200 trees\n",
    "rf = #Student put code here\n",
    "rf.fit(#Student put code here)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n",
    "#Get the discrete class predictions and a confusion matrix, and plot like was done above\n",
    "preds_rf = #Student put code here\n",
    "cm_rf = #Student put code here\n",
    "\n",
    "np.set_printoptions(precision=3)\n",
    "plt.figure(figsize = (10,10))\n",
    "plot_confusion_matrix(#Student put code here)\n",
    "\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "One interesting aspect of image is data is that we can often improve our models by using synthetic data. The key is to produce such data in a way that it doesn't deviate too far from realistic examples. We can do this by making small pertubations to the image matrix, such as adding random noise, adding blurs and by performing small rotations.\n",
    "\n",
    "We'll do that here using tools from scipy alone (note there are other image libraries that offer even more functionality). We'll then take the augmented data set and build another random forest and see if we can improve our model.\n",
    "\n",
    "First let's create a function that adds random gaussian noise, rotates the angle of the number, and applies some gaussian smoothing to create blur."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from scipy.ndimage import gaussian_filter \n",
    "#from scipy.misc import imrotate - deprecated\n",
    "\n",
    "\n",
    "def distort_image(vec, noise_std, rotation_angle, blur_std):\n",
    "    '''\n",
    "    Pass in a row of training data and distortion params\n",
    "    Return a distorted image\n",
    "    Params:\n",
    "    - vec: a row of MNIST data\n",
    "    - noise_std: parameter to control the standard deviation of random noise\n",
    "    - rotation_angle: degrees we will rotate the number\n",
    "    - blur_std: parameter to control the amount of local smoothing/blurring\n",
    "    \n",
    "    - Note: it is best to do the rotation first\n",
    "    '''\n",
    "    shape = (28,28)\n",
    "    mx = 255\n",
    "    mn = 0\n",
    "    img = np.reshape(#Student put code here)\n",
    "    #img_rot = imrotate(#Student put code here) #function is deprecated\n",
    "    img_rot = img\n",
    "    noise = np.random.normal(#Student put code here)\n",
    "    img_rot_blur = gaussian_filter(#Student put code here) + noise\n",
    "    #Adding noise can create values out of allowable range, so have to clip\n",
    "    img_rot_blur[img_rot_blur>mx] = #Student put code here\n",
    "    img_rot_blur[img_rot_blur<mn] = #Student put code here\n",
    "    return #Student put code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before building more training data, let's first plot some examples here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vec = X_train.iloc[10].values\n",
    "\n",
    "plt.figure(figsize = (12,12))\n",
    "angles = [0, -20, 20]\n",
    "for i,noise in enumerate([0.1, 30, 60]):\n",
    "    for j,blur in enumerate([0, 1, 2]):\n",
    "        img_rot_blur = distort_image(vec, noise, angles[i], blur)\n",
    "        plt.subplot(3,3,i*3+j+1)\n",
    "        plt.imshow(img_rot_blur, cmap='gray')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we need to choose distortation parameters that are significant enough to create a new image, but don't make it unreadable. This is a subjective call (which can be tested empirically...how?). What are some reasonable values?\n",
    "\n",
    "Our next step will be to generate another 10k training data points, and then run a second random forest.\n",
    "\n",
    "Note to students: choose your own values here, including total new and the distortation parameters. We'll execute and then see as a class what works best (in a sense we'll parallelize the exploration of the parameter space)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "n = 10000\n",
    "\n",
    "new_xs = []\n",
    "new_ys = []\n",
    "for i in range(n):\n",
    "    #Get a random example, note this samples with replacement\n",
    "    samp_i = np.random.randint(#Student put code here)\n",
    "    vec = X_train.iloc[samp_i].values\n",
    "    #Now generate some parameters, choosing them randomly\n",
    "    #For angle, sample uniformly from a particular range\n",
    "    angle = #Student put code here\n",
    "    #For noise, pick a single value\n",
    "    noise = #Student put code here\n",
    "    #For blur, sample uniformly from interval [0,1]\n",
    "    blur = #Student put code here\n",
    "    #Get a random image and convert back to \n",
    "    new_xs.append(distort_image(vec, noise, angle, blur).ravel())\n",
    "    new_ys.append(y_train.iloc[samp_i])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Now convert this fake data to DataFrames and append to original\n",
    "X_train_fake = pd.DataFrame(#Student put code here)\n",
    "y_train_fake = pd.Series(#Student put code here)\n",
    "\n",
    "#Let's not forget to throw out the features we tossed out earlier\n",
    "X_train_fake = #Student put code here\n",
    "\n",
    "#Now let's concatenate the fake data to the original\n",
    "X_train_aug = pd.concat(#Student put code here)\n",
    "y_train_aug = pd.concat(#Student put code here)\n",
    "\n",
    "X_train_aug.shape, y_train_aug.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Rerun RF, make predictions and plot confusion matrix as is done above\n",
    "rf =#Student put code here\n",
    "rf.fit(#Student put code here)\n",
    "\n",
    "preds_rf = rf.predict(#Student put code here)\n",
    "\n",
    "cm_rf = confusion_matrix(#Student put code here)\n",
    "\n",
    "np.set_printoptions(precision=3)\n",
    "plt.figure(figsize = (10,10))\n",
    "plot_confusion_matrix(#Student put code here)\n",
    "\n",
    "plt.show()\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
